Jasper Slingsby
“Replication is the ultimate standard by which scientific claims are judged.” - Peng (2011)
Sadly, we have a problem…
‘Is there a reproducibility crisis?’ - A survey of >1500 scientists (Baker 2016; Penny 2016).
Let’s start being more specific about our miracles… Cartoon © Sidney Harris. Used with permission ScienceCartoonsPlus.com
“Five selfish reasons to work reproducibly” (Markowetz 2015)
Some less selfish reasons (and relevant for ecoforecasting):
It speeds progress in science by allowing you (or others) to rapidly build on previous findings and analyses
It allows easy comparison of new analytical approaches to older ones
It makes it easy to repeat analyses on new data, e.g. for ecological forecasting or LTER1
The tools used are useful beyond Reproducible Research, e.g. building websites
Reproducible research skills are highly sought after!
From “A Beginner’s Guide to Conducting Reproducible Research” (Alston and Rick 2021):
1. Complexity
2. Technological change
3. Human error
4. Intellectual property rights
‘Data Pipeline’ from xkcd.com/2054, used under a CC-BY-NC 2.5 license.
Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.
Entail overlapping/intertwined components, namely:
This is a big topic in itself and has a separate section in my notes. I encourage you to read the notes as this is NB information for you to know, and the content is still be examinable, although I will not expect you to know it in as much detail.
Data loss is the norm… Good data management is key!!!
The ‘Data Decay Curve’ (Michener et al. 1997)
The Data Life Cycle, adapted from https://www.dataone.org/
Good data management begins with planning. You essentially outline the plan for every step of the cycle in as much detail as possible.
Fortunately, there are online data management planning tools that make it easy to develop a Data Management Plan (DMP).
Screenshot of UCT’s Data Management Planning Tool’s Data Management Checklist.
A DMP is a living document and should be regularly revised during the life of a project!
I advocate that it is foolish to collect data without doing quality assurance and quality control (QA/QC) as you go, irrespective of how you are collecting the data.
An example data collection app I built in AppSheet that allows you to log GPS coordinates, take photos, record various fields, etc.
There are many tools that allow you to do some quality assurance and quality control as you collect the data.
“The fun bit”, but again, there are many things to bear in mind and keep track of so that your analysis is repeatable. This is largely covered by the sections on Coding and code management and Computing environment and software below
Artwork @allison_horst
Project files and folders can get unwieldy fast and really bog you down!
The main considerations are:
Most projects have similar requirements
Here’s how I usually manage my folders:
“Point-and-click” software like Excel, Statistica, etc may seem easier, but you’ll regret it in the long run…
Coding is communication. Messy code is bad communication. Bad communication hampers collaboration and makes it easier to make mistakes…
Streamline, collaborate, reuse, contribute, and fail safely…
It’seasytowritemessyindecipherablecode!!! - Write code for people, not computers!!!
Check out the Tidyverse style guide for R-specific guidance, but here are some basics:
#Header indicating purpose, author, date, version etc
#Define settings and load required libraries
#Read in data
#Wrangle/reformat/clean/summarize data as required
#Run analyses (often multiple steps)
#Wrangle/reformat/summarize analysis outputs for visualization
#Visualize outputs as figures or tablesVersion control tools can be challenging , but also hugely simplify your workflow!
The advantages of version control1:
repositories (“repos”) or gists (code snippets)cloning the repo to your local PC. You can “push to” or “pull from” the online repo to keep versions in sync.commits
commited with a commit message. Each commit is a recoverable version that can be compared or reverted toforking repos and working on their own branch.
pull requests
owners can accept and integrate changes seamlessly by review and merge the forked branch back to the main branchcommit or pull requests provide a written record of changes and track the user, date, time, etc - all of which and are useful tracking mistakes and blaming when things go wrongassign, log and track issues and feature requestsThis should all make more sense after the practical, but here are some pretty pictures to drive some of this home…
Artwork by @allison_horst CC-BY-4.0
Artwork by @allison_horst CC-BY-4.0
Sharing your code and data is not enough to maintain reproducibility…
Software and hardware change with upgrades, versions or user community preferences!
The simple solution is to carefully document the hardware and versions of software used so that others can recreate that computing environment if needed.
sessionInfo() function, giving details like so:R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.0
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.4.1
loaded via a namespace (and not attached):
[1] rstudioapi_0.14 knitr_1.42 magrittr_2.0.3 tidyselect_1.2.0
[5] munsell_0.5.0 colorspace_2.1-0 R6_2.5.1 rlang_1.1.0
[9] fastmap_1.1.1 fansi_1.0.4 dplyr_1.1.2 tools_4.2.3
[13] grid_4.2.3 gtable_0.3.1 xfun_0.37 utf8_1.2.3
[17] cli_3.6.0 withr_2.5.0 htmltools_0.5.4 yaml_2.3.7
[21] digest_0.6.31 tibble_3.2.1 lifecycle_1.0.3 farver_2.1.1
[25] RColorBrewer_1.1-3 vctrs_0.6.3 glue_1.6.2 evaluate_0.20
[29] rmarkdown_2.20 labeling_0.4.2 compiler_4.2.3 pillar_1.9.0
[33] generics_0.1.3 scales_1.2.1 jsonlite_1.8.4 pkgconfig_2.0.3
A better solution is to use containers like docker or singularity.
These are contained, lightweight computing environments similar to virtual machines, that you can package with your software/workflow.
You set your container up to have everything you need to run your workflow (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly first time.
This is covered in more detail in the data management lecture, but suffice to say there’s no point working reproducibly if you’re not going to share all the components necessary to complete your workflow…
Another key component here is that ideally all your data, code, publication etc are shared Open Access - i.e. they are not stuck behind some paywall…